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We study the problem of analyzing influence of various fac- 
tors affecting individual messages posted in social media. 
The problem is challenging because of various types of influ- 
ences propagating through the social media network that act 
simultaneously on any user. Additionally, the topic composi- 
tion of the influencing factors and the susceptibility of users 
to these influences evolve over time. This problem has not 
studied before, and off-the-shelf models are unsuitable for 
this purpose. To capture the complex interplay of these var- 
ious factors, we propose a new non-parametric model called 
the Dynamic Multi-Relational Chinese Restaurant Process. 
This accounts for the user network for data generation and 
also allows the parameters to evolve over time. Designing 
inference algorithms for this model suited for large scale 
social-media data is another challenge. To this end, we 
propose a scalable and multi-threaded inference algorithm 
based on online Gibbs Sampling. Extensive evaluations on 
large-scale Twitter and Facebook data show that the ex- 
tracted topics when applied to authorship and comment- 
ing prediction outperform state-of-the-art baselines. More 
importantly, our model produces valuable insights on topic 
trends and user personality trends, beyond the capability of 
existing approaches. 

1. INTRODUCTION 

Social networking sites, such as Twitter, Facebook, MyS- 
pace etc, have proven to be extremely popular platforms for 
users for sharing views and opinions using short postfl Un- 
derstanding and analyzing topics in social media has become 
immensely important for a variety of stakeholders, such as 
companies advertising products and identifying customer 
segments, social scientists and national security agencies, 
leading to a surge in research interest [151 1121 1141 1211 [1] 1171 
123] . There are two major distinguishing features of social 
media data. First, users are influenced by a variety of fac- 
tors when posting messages. The four major factors have 
been identified to be personal preferences of the users, their 
immediate network of friends on the network, geographic or 
regional issues and events and world-wide happenings [23| . 
While all factors typically affect all users, different users 
have different 'personalities', in that they are influenced by 
these factors in different degrees. Secondly, social media 
data is inherently dynamic. Topics follow different 'trends'; 
individual interests of influential users, or issues starting off 
within a small network of friends, sometimes lead to global 
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upheavals, while other enjoying global popularity are slowly 
relegated to individual favorites. Similarly, user personali- 
ties also evolve and show trends over a geography or sub- 
network. 

Owing to these multitude of factors, and the intrinsic in- 
terplay between them, analysis of social media data has been 
a major challenge. Most existing approaches fall short of ad- 
dressing the problem in its entirety, and only model isolated 
factors and their interactions |17II21| [T]. A major hurdle for 
sophisticated models is the scale of the data; the associated 
inference algorithms need to be scalable and efficient. 

In this paper, we propose a non-parametric probabilistic 
approach for analyzing social media data. Specifically, we 
first propose an augmentation of the Chinese Restaurant 
Process 16 , called the Multi-Relational Chinese Restaurant 
Process (MRelCRP), that accommodates users and multi- 
ple relationships over them, for assigning topics to posts. By 
using relationships, the MRelCRP defines a new and differ- 
ent family of distributions compared to the traditional non- 
parametric processes such as the Dirichlet Process H], and 
its hierarchical versions [20]. We further propose a dynamic 
version of the MRelCRP (D-MRelCRP) that allows tempo- 
ral evolution of the model parameters to capture topic and 
personality trends. The rich interactions of various parame- 
ters in the model are able to capture the various interplays 
in social media data. Crucially, we propose an efficient and 
multi-threaded algorithm, based on online collapsed Gibbs 
sampling, for performing learning and inference for Dynamic 
MRelCRPs. 

We evaluate the proposed model on two large scale datasets. 
The first dataset consists of 360 million posts from Twitter. 
The second dataset consists of 300K posts from Facebook. 
We demonstrate both qualitatively and quantitatively the 
goodness of the topics discovered by our model. When em- 
ployed for predicting authorship and user activity, models 
using these topics significantly outperform state-of-the-art 
baselines. More importantly, our model is able to discover 
interesting and insightful topic and personality trends. For 
example, our analysis shows that users posts are mostly in- 
fluenced by personal preferences, rather than global, regional 
or social- network factors, except in times of major world 
events, when users become swayed by global influences at 
the cost of personal preferences. We are not aware of any 
existing model that can perform such a wide array of anal- 
yses effectively on social media data. 

The rest of the paper is structured as follows. We discuss 
related work on social media analysis and topic models in 
Section [2] We describe our proposed model in Section [3] and 



the associated inference algorithm in Section [4] Experimen- 
tal results are presented in Section [5] and we conclude in 
Section [5] 

2. RELATED WORK 

Here, we discuss our contributions in the light of related 
work in non-parametric probabilistic modeling and social 
media analysis. 

Non-parametric models: The Dirichlet Process (DP) 
[4] is a prior over a countably infinite set of atoms, and is 
popularly used as a prior for mixture models (DP Mixture 
Model) in applications, where the number of clusters is dif- 
ficult to provide as a parameter. The Chinese Restaurant 
Process |16] provides a generative description for the Dirich- 
let Process, and is useful for designing sampling algorithms 
for DP mixture models. The distributions defined by these 
models are exchangeable, in that different permutations of 
the data are equally probable. 

CRPs have been extended to handle distances and rela- 
tions. The distance dependent CRP (DD-CRP) [7] takes 
into account a distance matrix over the input data points. 
Unlike the DP-HDP family, this results in a distribution that 
is not exchangeable, which is a feature of many applications. 
In comparison, the RelCRP uses an additional non-unique 
label for each data point, and a general graph defined over 
them. The resultant distribution is exchangeable. As such, 
the DD-CRP and the RelCRP define different families of 
distributions, and one cannot be represented by the other. 
There is different body of work that use CRPs for modeling 
relations [13] and their dynamic evolution These are 

unsuitable for our current application, where we do topical 
analysis of the data points, based on relations between their 
(user) labels. 

Many applications require multiple coupled Dirichlet Pro- 
cesses. The Hierarchical Dirichlet Process (HDP) [20] is one 
way to introduce coupling using a two level structure. The 
HDP can be useful, for example, for extending the popular 
Latent Dirichlet Allocation (LDA) model [B], for countably 
infinite number of topics |T]. The HDP can be equivalently 
represented by an extension of the CRP called the Chinese 
Restaurant Franchise (CRF) [50]. Just as the CRF intro- 
duces coupling between CRPs, the MultiRelCRP introduces 
coupling between RelCRPs. However, the nature of the cou- 
pling in the MultiRelCRP can be much richer, depending on 
the relationships, as we explain in Section T3. 2 1 

Temporal evolution has been addressed in the context of 
non-parametric models 2,3, 1 , where the parameters of the 
the corresponding static model become functions of time. 
Some of the approaches are amenable to scalable inference, 
while others are not. For the Dynamic MRelCRP, we use the 
dynamic evolution of the parameters proposed in the con- 
text of Recurrent CRF [3] [1], because of the scalability of 
the associated inference problem. Note, however, that the 
similarity between the Recurrent CRF and the Dynamic- 
MRelCRP is only in the temporal evolution of model pa- 
rameters. The static model is a HDP/CRF, as compared to 
the RelCRP in our case. 

Social Media Analysis: There has been a surge of lit- 
erature on problems involving social media content. Work 
has mostly been focused around (a) Content analysis on mi- 
croblogs, (b) Inferring user interests and (c) Mining patterns 
of variation on social media, as we discuss below. 

(a) Most content analysis papers [10] use standard topic 



models such as LDA [8j or basic metrics like tf-idf. Focusing 
on the specific content of miroblogs, Ramage et. al. [17] pro- 
posed an LDA variant that accounts for hashtags in content 
analysis. One problem with this approach is that hastags 
are not general features of social media data, and are of- 
ten unreliable. There is little modeling work that takes into 
account the rich features of social media such as network, 
geography, etc. 

(b) In the context of microblogging sites, content recom- 
mendation approaches [151 1121 114] assessing user interests 
based on their activities. Recently, Wen et. al [21] have pro- 
posed an approach which studies the influence of the network 
on users. Ahmed et. al. [TJ model the dynamics of user in- 
terest and also the account generic popularity of a particular 
item, but do not consider the influence of various external 
factors like network of users and geography. Thus most of 
related work either deals with the influence of a single factor 
or user preferences. 

(c) Yang et. al. [23] made one of the first attempts at 
understanding the temporal evolution of patterns on social 
networking sites like Twitter. Apart from temporal dynam- 
ics, study of such patterns with respect to geography and 
other factors has not been explored for content on social 
networking sites. 

3. MODEL FOR SOCIAL MEDIA 

In this section, we describe our Dynamic Multi- Relational 
Chinese Restaurant Process model, which we employ to study 
the interplay of world-wide, geographic, network and user 
specific factors, and their dynamics, in social media. We 
build up our model in steps, first describing the static Rela- 
tional Chinese Restaurant Process, then incorporating mul- 
tiple relations, and finally adding temporal evolution to it. 
In our application, the basic task is to associate topics with 
individual posts or tweets. The topics correspond to con- 
cepts such as 'movies', 'sports', 'politics' etc. Unlike topic 
models such as LDA [5J, which associate a distribution over 
topics with each document, we assume that each post, con- 
sidering its shortness, corresponds to exactly one topic. This 
makes the model simpler and the associated inference algo- 
rithm more efficient and scalable. 

3.1 Relational Chinese Restaurant Process 

The Dirichlet Process [4] has become a popular non-parametric 
prior in clustering applications, where the number of clus- 
ters is not needed to be specified apriori, but instead grows 
with the data size. The Chinese Restaurant Process (CRP) 
[16| provides a fanciful description of the Dirichlet Process, 
by imagining data points as customers being seated at ta- 
bles, which represent clusters, as they enter the restaurant. 
Let Wi denote the i th data point, or post in our case, and z 1 
denote the cluster (or table assignment) for the post. Then, 
given the assignments Ziu-i of the first i— 1 customers to K 
tables, the conditional distribution for the table assignment 
of the i th customer is given by the CRP as follows: 

P(zi = fc|zi : (i_i), a) oc rift k<K 

a k = K + l (1) 

where is the number of customers already assigned to 
table k. Note that this a 'rich gets richer' model, where 
tables with more customers have a higher probability of get- 
ting new customers, but new tables also have a non-zero 
probability (a) of getting customers. 



When each table i is associated with a (topic) distribution, 
with parameters 4>i drawn iid from an appropriate base dis- 
tribution H, the CRP can be used as a prior for mixture 
distributions. Once the i th customer is seated at a table Zi, 
the corresponding data item u>i can be drawn independently 
from the distribution <j> Zi associated with the table. For a 
generative model for posts, each topic distribution can be 
a multinomial Mult(tj>i) over the post vocabulary, so that 
each word Wij of the post is generated independently from 
that topic, and the base distribution H can be chosen to be 
a Dirichlet Dir(fi), since it is conjugate to the multinomial. 

Though defined as a sequential process, the CRP mix- 
ture model can be easily shown to be exchangeable, which 
means that all permutations of observed data {wi} have the 
same probability under the model. The Chinese Restaurant 
Process has been widely used in generative models for dif- 
ferent applications [2UI 1191 I22j . However, it is unsuitable 
for social media data, since it ignores a fundamental aspect 
— the social network over users who generate the content. 
Specifically, each post has associated with it a user variable 
m, that takes values from a finite set of users U. These 
users are further connected by a network of relationships. 
To accommodate this, we augment the Chinese Restaurant 
Process to handle such relationships. 

In the Relational Chinese Restaurant Process (RelCRP), 
each customer (data point) is associated with a label m £ U. 
In the context of social media data, we will refer to each 
element in U as a user, and say that each data point or post 
has a user label. In addition, we have a relationship TZ, such 
that each element r £ TZ is a subset of U. We can imagine 
TZ as defining a set of hyper-edges over elements in U. Note 
that we do not fix the candinality of the elements in R. We 
will see the need for this shortly. Using TZ, we can define the 
neighbors N (u, R) of an element u £ U as all other elements 
that share a relation with u: N(u, R) = {u' 6 W : 3r 6 
1Z, u € r, u G r}. 

Given the additional u, labels and the relationship TZ, the 
conditional distribution of the table assigned to the i th cus- 
tomer is defined in RelCRP as follows: 

P(zi = fe|2i : (j_i), ui-.i, TZ, a) ocn^ R ' Ui) k<K 

a k = K + 1(2) 

where n^ R ' u '^ is the number of neighbors of «j in TZ already 
assigned to table k. 

Let us now look at some example uses of the RelCRP in 
the context of social media data. We start from the trivial 
case, where the RelCRP reduces to the CRP, and then move 
on to more interesting ones. 

Influence of World-wide Factors: Very commonly 
users are influenced by globally popular events or entities 
when choosing a post topic. For example, users who are 
not fans of Michael Jackson tweeted on this topic in the 
event of his unexpected death. This can be captured in 
the RelCRP by associating a unique label m with each data 
point, along with a 'complete' relationship 7Z m , that contains 
a single relation (hyper-edge) over all u £ U. In this case, 
Equation ([2]) reduces to: 

P w (zi = k\zu(i-i), Ui-AjTZw, a) ccn k k < K 

a k = K + l (3) 

where n ^ nu " u ^ = nk [ s the number posts by all users 
(which is the neighbor set of m) already assigned to table 



(topic) k. Note that this is the same as Equation {T}. Thus 
the RelCRP is able to recover the traditional CRP, using 
unqiue data labels and a 'complete relationship'. 

User's Personal Preferences: One of the most sig- 
nificant factors influencing the content of a post is the pref- 
erence of the associated user. A specific user may be more 
interested in 'movies' that in 'sports' or 'polities'. Evidence 
of this can be found in the topics of this user's earlier posts 
— a user is more likely to post on a topic that she has used 
more frequently. To capture this in the RelCRP, we set Ui 
to be user identifier, and simply construct an empty relation 
TZ U over U. Given (U,TZ U ), Equation ((2]| reduces to: 

P u (zt = k\zi : (i-i) iUui ,Hu,a) oc n£* k < K 

a k = K + l (4) 

where n 1 ^ u ' = riff is the number of posts by user m 
(who is her only neighbor) already been assigned to table k. 
Note that even the case of the empty relation 7Z U cannot be 
captured by the traditional CRP. 

Influence of Friend Network: A user is often influ- 
enced by the post topics of her friends. To capture this, 
as before, we set the label Ui of the post to be the user id, 
and construct 7Z„ based on the friendship network: for each 
follower or friendship relation between users Ui and Uj, we 
add a tuple («j , Uj ) to TZ n ■ Note that in this case all el- 
ements of 1Z n have cardinality 2. Given {LA, TZn), P n {Zi = 
k\z 1: ( i _ 1 - ) ,ui:i,7Z„,a) has the same form as Equation ((2j), 
where n ^ 7lrl ' u ^ [ s now the number of times followees or 
friends of user u; have posted on topic k. 

Influence of Geography: As a final example, a user's 
posts may also be influenced by geographic trends. For in- 
stance, an national election draws a lot of attention from cit- 
izens of that country. This can be captured by the RelCRP, 
by again associating labels m with user id's, and construct- 
ing 1Z g to capture geographic locations: adding a hyper- 
edge in lZ g over all users in a specific country. Typically, 
the geographic location can be known from the profile of 
the user, and we assume such a construction to be possi- 
ble. Note that in this case every edge has a different car- 
dinality, and most will be extremely large. Interestingly, 
the RelCRP does not require maintaining the explicit rela- 
tions, but only simple statistics over them. Given (lA,TZ g ), 
P g (zi = Ui-.i, TZ g , a) again takes the form of Equa- 

tion where j^^ 8 '" 1 ' j s now the number of times users 
in the same geography as user m have posted on topic k. 

Thus, the RelCRP can be used to capture the different 
posting patterns in social media within a single framework, 
in a way that the traditional CRP cannot. Just like the CRP, 
however, the RelCRP can be used to define a mixture model 
by associating a topic with each table. It can be shown that 
the resultant distribution remains exchangeable. 

3.2 Multi-Relational CRP 

We have seen that the RelCRP is able to model the in- 
dividual effect of the world-wide factors, user preferences, 
friend network and geographic factors when the labels and 
relationships are appropriately defined. However, in real- 
ity, all of these influences act simultaneously on any user, 
and their interplay determines the content of each of her 
posts. Further, this aggregate influence pattern is user- 
specific. For example, different users are affected differently 
by the same combination of world and geographic events. 



We now present the Multi-Relational Chinese Restaurant 
Process (MRelCRP) that captures such aggregate influences 
using multiple relations defined over the same user labels. 

The MRelCRP is characterized by a set of labels U, along 
with m relations {TZ i }'^ =1 defined over U. With the i th 
data point (post), we associate an additional variable fi, 
which takes values from {1 . . . m}, indicating the relation- 
ship that influenced this data point. This depends on the 
associated label (user) m. For each label u £ IA, there is a 
m-dimensional multinomial distribution Mult(ir u ). Each n u 
is assumed to be generated iid from a Dirichlet Dir(a). We 
interpret n U j as the probability of label u being influenced 
by the j th relationship IZj, i.e. P(fi = j\iu = u). We may 
imagine tv u as reflecting the 'personality' of user u. Given 
these parameters, and the assignment of the first i — 1 posts 
to K topics, the MRelCRP defines the conditional distribu- 
tion of the topic assignment of the i th post with label itj as 
follows: 

P(zi = k\z i: i-l,Ul:i, a, {TZj}, {7T„}) 

= ^7Tu ( jP(«j = k\Zi:i-l,Ul:i,a,TZj) (5) 
3 

which is a mixture of m individual RelCRP distributions, 
defined according to Equation ([2|. This can be interpreted 
as first selecting a particular RelCRP from a prior distribu- 
tion specific to the label Ui, and then selecting a table using 
the selected RelCRP. 

The aggregated influences in the post generation process 
can now be captured by the MRelCRP framework, by con- 
sidering the set of 4 relationships {1Z W ,TZ u ,lZ n ,TZg} ■ A 4- 
dimensional influence factor ir u is sampled for each user u 
from Dir(a w ,a u ,a n ,a g ). This can be imagined to reflect 
the personality of the user. Then, for each post, a topic is 
selected for it, in two steps, using Equation |[5j. Finally, 
the individual words in the post are sampled iid from this 
selected topic. This is described in Table [3721 

T able 1: Generative Process for MRelCR P 

1. For each topic k 

a. Sample (j>k ~ Dir(f3) 

2. For each user u 

a. Choose 7r„ ~ Dir(a w , a u , a n , ct g ) 

3. For each post i 

a. Choose fi ~ Mult(n Ui ) 

b. Choose zt ~ P(z i \zi :i -i,ui : i,a,TZf i ) 

c. For the each word j of post i 
i. Choose Wij ~ Mult(4> Zi ) 



Couplings in the MRelCRP: It is important to ob- 
serve the coupling that the MRelCRP creates between dif- 
ferent RelCRPs, that helps capture the interplay of various 
factors for social media, (a) First, we analyze the dependen- 
cies for a single relationship IZi. Observe that there are IV 
RelCRPs, one for each user (label). However, all of these IV 
RelCRPs need not be distinct. This depends on the nature 
of the relationship. For example, in the setting above, 1Z W 
is a 'complete' relationship. As a consequence, the neighbor 
sets are the same for all users, and the world-wide RelCRP 
is identical for all users. For the geographic relationship 
Tig, since the individual relations do not overlap, the ge- 
ographic RelCRP is identical for all users from the same 



country. This creates one type of dependence across users. 
In contrast, for the friend relationship 1Z n , in general dis- 
tinct users have different sets of friends, and their RelCRPs 
are distinct. However, they are still coupled, since the un- 
derlying topics are the same, and a post by user u on topic k 
increases the count n NI ' 7l "' u ^ k for all friends u of u. Thus, 
for all of these three relationships, evidence can flow between 
users over hyper-edges in the relationship. Finally, for the 
user preference relationship 1Z U , the RelCRP for each user 
is distinct, and there are no dependencies, (b) Now, we an- 
alyze the new dependencies that are created when multiple 
relationships are coupled in the MRelCRP. Observe that for 
m relationships, there is a total of m x IV RelCRPs, m for 
each user, but all of these need not be distinct, as above. 
The m distinct RelCRPs for each user are now coupled; ev- 
idence can flow between relationships through the users. In 
the context of social media, this leads to interplay between 
world-wide, geographic, network factors and personal pref- 
erences. 

3.3 Dynamic Multi-relational CRP 

The two key distinguishing aspects of social media data 
are the network structure, and the dynamic nature of the 
topics and user influence patterns or personalities. The 
MRelCRP captures the network aspect, but falls short on 
the second count. Before extending our model, we first enu- 
merate the different aspects of the data that evolve with 
time, (a) The number of topics changes as old topics die out 
and new topics are born, (b) Popularity of topics change, 
world-wide, in specific geographies, sub-networks and in the 
preferences of individual users. We call these topic trends, 
(c) User personalities change, and they become more or less 
susceptible to being influenced by world-wide, geographic, 
network and individual preferences, (d) Existing topics also 
evolve as new words enter the vocabulary and existing words 
go out of fashion. We now propose the Dynamic Multi- 
relational Chinese Restaurant Process (D-MRelCRP) that 
accounts for all of these temporal evolutions. In reality, the 
number of users also change over time and the network grows 
or shrinks, but we do not consider this aspect in our current 
model. 

We assume that the data has been segmented into epochs, 
or in other words, each data element is labeled with a time- 
stamp that takes values from {1 . . . T}. In practice, epochs 
may be appropriately defined (eg. hour, day, week, etc) de- 
pending on the application. The Dynamic MRelCRP con- 
sists of one MRelCRP for each epoch. We introduce depen- 
dencies between the parameters of the MRelCRPs across 
epochs to capture the different aspects of temporal evolu- 
tion, as we describe next. We use additional subscripts on 
parameters to indicate epochs. 

Note that individual RelCRP's naturally allow the num- 
ber of topics to change. We do not need to address this 
separately in the D-MRelCRP. 

3.3.1 Topic Trends 

Different topics have different trends, in that some start 
out being popular in certain geographies, to being global 
hits. Some others may start as preferences of influential 
individual users and evolve to regional or world favorites. To 
capture this, topic popularities in our model need to change 
over epochs. Since we have modeled popularity of topics 
using counts, to make this approach dynamic, topic counts 



of specific epochs are made dependent on those of earlier 
epochs, following the approach of PQ. We extend the basic 
RelCRP conditional distribution (Equation with epoch 
indices as follows: 

P t (z x — k\zi:(i-i),ui:i,H,a) oc n^l R ' Ui ' + n^t k<K 

a k = K + l 

where n^t is the number of neighbors of u in 1Z already 

assigned to table k in the t th epoch, while fL^^' u ^ captures 
the historical counts in recent epochs, with exponentially 
decaying weights, as follows: 



-N(R,u) 



-S/X N(R,u) 
,l k,t-S 



where A is the decay factor. The MRelCRP for t th epoch 
is now defined using a mixture of such RelCRP conditionals 
as in Equation (JSJ. 

3.3.2 User Personality Trends 

It is natural for user personalities to be time dependent as 
well. A user may become more susceptible to the influence of 
her friends and deviate from her earlier personal preferences. 
In the MRelCRP framework, this corresponds to the mixture 
distribution n u for each user u being a function of the epoch. 
Recall that each n u is sampled iid from a Dirichlet prior 
Dir(a w , a u , a n , a g ). We introduce a temporal dependence 
by adding a dynamic component to the prior parameter, in 
the spirit of [I], as follows: 

7r u ,t ~ Dir(a w + a u ,w,t,a u + <3«,«,t, + a u ,n,t, oc g + a«,g,t) 



*«,/,* = h, s =i ' 



for / £ {w,u,n, g}, and m u ,f,t being the number of times 
user u was influenced by relationship / in epoch t. 

3.3.3 Evolving Topic Distributions 



capture this, we again introduce a temporal dependence in 
the prior distribution. Specifically, each topic distribution 
4>k,t is now sampled from Dir(/3k,t + P)- The element ph,w,t 
of dynamic component /3k,t depends on how frequently the 
word w in the vocabulary has been historically observed un- 
der topic k until epoch t — 1: 



-5/X 



mk,w,t-s 



where m*. jTOj t corresponds to the number of times word w is 
associated with the topic k in epoch t. 

These three dynamic dependences introduced between the 
parameters of the MRelCRPs corresponding to different epochs, 
defines our complete D-MRelCRP. 

4. INFERENCE 

In this section, we discuss the key challenges in performing 
inference for the proposed D-MRelCRP model, and present 
our inference algorithm addressing these challenges. The 
inference problem involves determining the posterior dis- 
tribution over the two latent variables variables, the topic 
label Zi : t and the influence variable fi t t, for all posts i in 
all epochs t. The parameter estimation problem involves 
finding the posterior distribution of the model parameters, 



the topic distributions 4>k and the personalities ir u of the 
users. The two problems are coupled, and solving them 
exactly is intractable 8 . We resort to approximate tech- 
niques based on collapsed Gibbs sampling. However, the 
traditional approach [S], where the topic and influence la- 
bels of each post are repeatedly sampled until convergence 
from the conditional distributions given all other labels, is 
infeasible for us given the size of the data. Even Sequen- 
tial Monte Carlo methods [9|, that rejuvenate a few older 
labels, are infeasible. We adopt the online algorithm [BJ, 
which was proposed for parametric models, and modify it 
appropriately for our model. In this approach, earlier labels 
are not revisited. This allows the algorithm to scale, at the 
expense of sub-optimal estimates at the beginning, and is 
also concordant with the online nature of social media data 
PQ. Before describing the details, we first describe the con- 
ditional distributions that are required by the algorithm. 

Conditional Distributions: In the online setting, the 
distribution for the influence factor fi for the i th post is 
conditioned on the topic and influence labels of all earlier 
posts, their user labels and the content of the current post. 
For the Dynamic MRelCRP, this looks as follows: 

P(fi = /[a,M»,2l:i)/l:(i-l),Wi:i,72.) 

oc{m u j,t + a u j >t + af)x{n ki +n kt ) (6) 

where, a = {a w ,a u ,a n ,a g }, and the counts are as defined 
in Section [3] 

The conditional distribution for topic label z;, additionally 
conditioned on influence factor fi, is given by: 



P(z x = fc|/i,j0,;?i:(i_i),ltl:i,Wl:i,7£,a) 



k < 



!=1 



n k,v,t+Pk,v,t+P 
£jU(»k,r,t+£*,r,t+]8) 



k = K+ 1 



where, Wu corresponds to the v word of the vocabulary, 
nk,v,t corresponds to the number of times v th word in the 
vocabulary is associated with topic k during epoch t. Note 
that online inference, the counts in the equations above cor- 
respond to the data instances (posts) which have arrived 
before the i th instance. Also, the conditional distributions 
for the static model (MRelCRP) can be obtained by remov- 
ing the historical counts in the above expression, specifi- 
cally by setting a u j, t = n^\ R ' u ^ = in Equation (|6j), and 

™kt ' ~ h,v,t = in Equation 0. Similarly, the condi- 
tional distribution for RelCRP, which has a single relation- 
ship 1Z can be obtained as a special case of the MRelCRP, 
by taking counts n^[ R ' u '^ with respect to 7?.. 

Parallel Inference Algorithm: A straight-forward 
online algorithm, that makes a single sequential pass over 
the data, is infeasible considering the scale of social media 
data. This necessitates a parallelized inference algorithm. 
Sampling based parallel inference algorithms for hierarchi- 
cal bayesian models has received attention in the literature 
[5j [18] . These approaches split data across threads or pro- 
cessors, execute Gibbs iteration on them independently, and 
finally, consolidates labels across threads asynchronously at 
the end of each iteration. In contrast, parallelization of 
our algorithm results in independent, online updates in each 
thread. Additionally, D-MRelCRP being a non-parametric 
model, new topics are created by each thread, and in the 
absence of repeated Gibbs iterations, are not sufficiently 



consolidated. As a result, we require a synchronous archi- 
tecture, where all new topics are explicitly consolidated by 
a master thread at the end of each iteration. Our multi- 
threaded inference algorithm is described in Table [2] 



After an initial batch phase (master thread: steps 1-4), 
the algorithm iterates over three phases: data access (master 
thread: step 6), computation (child threads: steps 2-5) and 
synchronization (master thread: steps 8-9, 11-13). The ini- 
tial batch phase is necessary to prevent creation of too many 
new topics at the beginning by different child threads. Note 
that the computation phase happens in parallel across the K 
child threads. Each child thread creates multiple new top- 
ics, whose counts are maintained locally. These counts are 
passed back to the master thread, along with other counts 
at the end of the computation phase. After receiving back 
labels from all child threads, the master thread re-samples 
labels for all posts assigned new topics by child threads. This 
helps in the consolidation of new topics, many of which may 
otherwise be quite similar. The iterations continue until all 
posts have been processed. 

5. EXPERIMENTAL RESULTS 

In this section we discuss in detail the experiments that 
we carried out using the proposed D-MRelCRP model on 
multiple large real social media datasets. We evaluate the 
following aspects of the model: 

(a) Model goodness: Ability to explain unseen data 

(b) Topics and topic labels: Our inference and learning al- 
gorithms assign a topic to each post, and also finds a distri- 
bution over words for each topic. We evaluate both aspects 
qualitatively and quantitatively. 

(c) User personalities and their trends: The major distinc- 
tive feature of our model is the influence label associated 
with each post. Using this label, we are able to estimate the 
user personalities, or the susceptibility of the user to various 
influencing factors, and their dynamics. We discuss various 
insights that we were able to find from personality trends. 

(d) Scalability: One of the main strengths of our inference 
algorithm is the ability to scale to hundreds of millions of 
data samples. We evaluate how the running time of our 
multi-threaded implementation scales with data size. 

(e) Relative importance of factors: The MRelCRP and D- 



MRelCRP models are able to combine together various influ- 
ence factors and their dynamics. We analyze the usefulness 
of the different factors for social media analysis. 
We would like to point out that no other single model is 
able to perform such a wide array of tasks in social me- 
dia analysis. Wherever possible, we make use of available 
ground truth or surrogates of it for quantitative evaluation 
and compare against best available baselines. However, as 
regards our main contribution — discovering user person- 
ality trends — there does not exist any existing algorithm 
that can perform this. 

Datasets: We carried out all our experiments on two 
different datasets: (1) Twitter: a collection of 360 million 
tweets crawled between June and December 2009, and (2) 
Facebook: a collection of about 300,000 posts obtained by 
extracting feeds from publicly available profiles over a span 
of three months. 

Default Parameter Settings: The hyper-parameters of 
our online Gibbs Sampler were initialized as : a = 0.1/K+l 
and P = 0.1. 

Baselines: We compare the performance of our models 
against the following state of the art models that have been 
shown to be effective for analyzing microblogs. (a) Latent 
Dirichlet Allocation (LDA) [S] (b) Labeled LDA [17] . 
and (c) Timeline [3]. Labeled LDA is not very generally 
applicable since it makes use of hashtags assigned to the 
posts to identify topic labels. While this meta-information 
is available on Twitter, Facebook does not support it. The 
Timeline model is the closest to our model in that it is a non- 
parametric topic model that also captures topic dynamics. 

5.1 Model Goodness 

Goodness of a model is evaluated by how well it is able 
to fit previously unseen data. Perplexity is a commonly 
used to measure generalization ability of topic models [8]. 
It is defined as the inverse of the geometric mean per-word 
likelihood: 

Per(A) = exp{- J2d=i logP(w d )/ J2d=i Nd}, where N d is 
the number of words in the d th post in the held-out test 
set D t , and logP(wd) is its log-likelihood. Lower values of 
perplexity indicate better generalization ability. 

We consider two different datasets for this experiment. 
For each model under consideration, we first train it on Twit- 
ter data, and then consider as test set a sample of 8 million 
tweets from the last one month in our dataset. Similarly, 
each model is trained on Facebook data, and evaluated on a 
sample of 40K posts from the last month's activity. Perplex- 
ities of various models are recorded in Table [3] Note that 
unlike our model, LDA requires the number of topics to be 
specified. We set it to the average number of topics discov- 
ered by our model across epochs. Labeled-LDA cannot be 
applied for Facebook data, since it requires hashtags. It can 
be seen that D-MRelCRP has the least perplexity in both 
the cases. Among baselines, Timeline has the best perplexity. 
This demonstrates that capturing both temporal evolution 
and relationships is important for explaining future data. 

5.2 Quality of topics 

The D-MRelCRP model assigns a topic label to each post, 
indicating its category, and also finds a distribution over vo- 
cabulary words for each unique topic label, indicating the 
semantics of the topic. Our hypothesis is that by modeling 
the different influences on the users, D-MRelCRP is able to 



Table 2: Parallel Inference Algorithm 



Master Thread 


1. 


Read first N posts 


2. 


Iterate t times 


3. 


For each post 2, sample Zi 1 ft 


4. 


Update joint counts 


5. 


Iterate until no new post 


6. 


Read next N posts 


7. 


For child thread j — 1 to K 


8. 


Send posts j(N/K) + (j + l)N/K, joint counts 


0. 


Receive labels {zi, fi} for N/K posts 


10. 


Wait until child threads complete 


11. 


Iterate t times 


12. 


For each post with new label Zi, sample 24, fi 


13. 


Update joint counts 


Child Thread 


1. 


Sleep until invoked by Master Thread 


2. 


Receive N/K posts, joint counts 


3. 


Iterate t times 


4. 


For each post i. sample z^, fi 


5. 


Return N/K labels {zi.fi} to Master Thread 



Table 3: Perplexity and Clustering Accuracy. 





Perplexity 


Clustering Ace. 


v TwiUer) 


Model 


Twitter 


Faccbook 


nMI 


R-Indcx 


Fl 


DMRclCRP 


1188.29 


1562.34 


0.93 


0.88 


0.86 


Timeline 


1582.86 


1802.9 


0.81 


0.72 


0.73 


L-LDA 


1982.76 


NA 


1 


1 


1 


LDA 


2932.06 


3602.0 


0.55 


0.52 


0.48 



better identify topics. To evaluate this, we check topic qual- 
ity in different ways. We directly evaluate the topic labels 
of posts by comparing against a reasonable gold-standard. 
Then, we indirectly evaluate the topic labels of posts by us- 
ing them as features in two prediction tasks. Finally, we 
identify significant topic trends and compare them qualita- 
tively with world knowledge. We provide more details on 
these three evaluations next. 

Clustering posts using topic labels. 

In our proposed models, there is a single topic label asso- 
ciated with each post. This results in a hard clustering of 
the posts according to topics. Therefore, one way to evalu- 
ate the topic assignment quality is to evaluate the clustering 
accuracy. Gold standard clusters of posts is typically hard 
to obtain. As an alternative, in the case of Twitter data, we 
consider hastags as cluster indicators. Since it is well known 
that hashtags are often poor indicators of post clusters, we 
consider only a few authoritative hasgtags as follows. We 
collected ~ 16K posts with hashtags corresponding to highly 
specific topics, such as #N1PS2009, #ICML2009, #bolly- 
wood, #hollywood, #www2009 etc. We consider this as the 
test set with gold-standard labels for evaluation. 

We use three standard metrics to evaluate clustering ac- 
curacy - Normalized mutual information, Rand index and 
F-measure. In Table [3] we record the performance of our 
D-MRelCRP model, and those of the three baselines on the 
Twitter dataset. Not surprisingly, labeled LDA correctly 
identifies the clusters all the time, by virtue of taking hash- 
tags as inputs. DMRelCRP comes close, in spite not using 
knowledge of hashtags at all, and performs better than all 
other models across all the three evaluation metrics. Fur- 
ther, on closer inspection, we found that the Labeled LDA 
clustering is not as good as the numbers indicate, and the 
two proposed models are often better. For example, DM- 
RelCRP splits the ~3K posts corresponding to the #movies 
hashtag into two topics, and separates out posts originat- 
ing from India. Comparison using KL-divergence shows this 
topic to be very similar to the #bollywood hashtag. The 
#sports hashtag shows a similar split. Such fine-grained dis- 
tinction is not possible for Labeled LDA, or Timeline, which 
do not capture geographic or other influencing factors. 

Prediction Tasks. 

Since the cluster gold-standards for posts are unreliable 
for Twitter, and unavailable for Facebook, we additionally 
perform the following indirect evaluation of topic assignment 
to posts. Topic labels are commonly used as reliable low- 
dimensional features for learning classifiers [Hj. We use the 
topic labels for posts for two representative prediction tasks 
in social media with reliable gold-standards: predicting post 
authorship and predicting commenting activity. 

Predicting Authorship: Given a post p and user u, this 
task is to predict if user u is the author of post p. We con- 



Table 4: Prediction Task Accuracies. 





Authorship Prediction 


Commenting Prediction 


Model 


Twitter 


Faccbook 


Twitter 


Faccbook 


DMRelCRP 


0.793 


0.734 


0.683 


0.648 


Timeline 


0.718 


0.669 


0.582 


0.579 


LDA 


0.521 


0.432 


0.429 


0.482 


L-LDA 


0.647 


NA 


0.542 


NA 



struct a Twitter test set having 20M tweets from the last 
15 days, and a Facebook test set having 40K posts from the 
last one month. For each user, we create training sets for 
Twitter and Facebook by including as positives all posts au- 
thored by that user, and as negatives a equal-sized random 
sample from posts authored by other users in the recent 
past. As features, we use the topic label of the post in- 
ferred by a specific algorithm, and the time-stamp. We use 
k- nearest neighbor classifier (k=5), where we consider mini- 
mum distance between the post p posts authored by u, with 
KL-divergence as topic distance and number of in-between 
days as the time difference. 

Predicting Commenting Activity: Given a post p by some 
user v, the task is to predict if user u comments on the post. 
We similarly construct test and training sets from Twitter 
and Facebook. As an additional feature, we consider the 
number of past interactions between users u and v. We 
again use a k — NN classifier (k=5) for prediction. 

The accuracies for both prediction tasks for different al- 
gorithms are recorded in Table [3] It can be seen that DM- 
RelCRP performs significantly better on both datasets. The 
standard topic model baselines, and also Timeline, do not 
perform very well on this task. This shows the usefulness of 
topics inferred by considering both relationships and tem- 
poral evolution. Labeled-LDA performs better than LDA, 
but in spite of using hash-tags, is significantly outperformed 
by our proposed approaches. 

In summary, the topics inferred using our model are signif- 
icantly more useful for prediction tasks involving users and 
posts compared to state of the art topic models. 

Topic Trending and Major Event Detection. 

The inferred topic label for each post, in conjunction with 
the user label, can be used to identify various topic trends. 
From the joint counts n^ )U ,t of the number of posts by user 
u at time epoch t on topic k, the probability pk,u,t of user u 
posting on topic k at epoch t can be estimated by normal- 
ization. By subsequent aggregation over subsets of users, 
popularities of different topics across different user segments 
can be plotted against epochs. When a particular topic k 
dominates over all others in an epoch, we flag that topics as 
a major event, and analyze it using the dominant words in 
the topic distribution <fik- We were able to identify several 
break-out events using DMRelCRP topics labels, as we de- 
scribe below. 

World-wide Events: World wide popularity of a topic k at 
epoch t is estimated by aggregating Pk,u,t ° ver all users u. 
The major world-wide events discovered by D-MRelCRP in- 
clude the demise of Michael Jackson (Jun 30) (top words: 
mj, michael, dead, singer, jackson, pop), the Fifa World Cup 
(Sep 15-30) (football, soccer, fifa, worldcup) and the launch 
of Google Wave (Dec 1-15) (wave, invite, google, launching) 
Geographic Events: The popularity of a topic k in a specific 
geography at epoch t is estimated by aggregating pk,u,t only 
over all users u in that geography. Jeff Goldblum's demise 



(Jul 1-15) (death, jeff, actor, goldblum, dies, end, era) was 
detected as major event in Australia and the UK. 

We were able to verify these world-wide and geographic 
major events using Google Insightfl The words from the 
specific topics appeared in the top searches during these spe- 
cific intervals, world-wide or in the specific geographies. We 
were similarly able to find major events for small networks 
of users (e.g. official page for Microsoft on Twitter @MSFT- 
News and its followed pages) and for important individual 
users (such as @ICML2009). In summary, DMRelCRP en- 
ables us to discover interesting topic trends and major events 
at different levels of granularity. 

5.3 Analysis of Influences 

The distinctive aspect of our model is the label /< indicat- 
ing the influencing factor behind the i th post. It is difficult 
to evaluate the accuracy of these inferred factors directly. 
Instead, we focus on aggregate analysis that can performed 
using this label, and the rich insights that we were able to 
unearth using this. 

Using the joint counts rik,uj,t of topics (k), users (u) and 
influence factors (/) in each epoch (t), we can estimate the 
probability pk,u,f,t of a specific user u posting on a topic fc af- 
ter being influenced by a factor / in epoch t, by normalizing 
appropriately. On aggregating over topics fc, the distribution 
Pu,f,t (corresponding to model parameter TT a ,t) over factors 
/ indicates the personality of the user u at epoch t. Plotting 
these distributions over epochs t shows the personality trend 
for user u. Since trends over individual anonymous users are 
not insightful, below we plot aggregate trends over different 
interesting user subsets. For this, we use heat-maps, where 
the matrix rows correspond to influence factors, columns to 
epochs and hotter colors indicate higher probability values. 

World-wide Personality Trends: First, we aggregate 
Puj.t over all users to estimate the world-wide susceptibil- 
ity of users to specific factors at a specific epoch (15 day 
period) . This trend is shown in the heat-map of Figure [T] 
(best viewed in color). The positions of the hotter colors 
and the color gradients are of interest. We can observe that 
the world-wide factor has the largest variance, followed by 
personal preferences, while the other trends are largely flat. 
Also, we can see that surges in world-wide influence happens 
mostly as the expense of personal preference. The largest 
such surge happens around Jun 30. This is when the news 
of Michael Jackson's death broke on Twitter, and we can see 
that users discarded their personal preferences and posted 
about this event. The strength of world-wide influence then 
subsides gradually, and users return to their personal pref- 
erences. We can see that world-wide influence rises again 
around Sep 15 and Dec 1, again at the expense of personal 
preference. The most popular topics at these times were 
FIFA World Cup and Google Wave. In summary, users are 
usually influenced mostly by their personal preferences and 
friend network, apart from times of significant world-events. 

Personality trends in specific geographies: Next, 
we aggregate p u j,t over users in specific geographies to es- 
timate susceptibility of users to specific factors in different 
parts of the world. Personality trends for 5 different geogra- 
phies, USA, UK, Australia, China, India, are shown using 
heat-maps in Figure [5] We can see many interesting pat- 
terns. The personality trends in USA, UK and Australia are 
largely similar, apart from the geographic influences which 

2 http: / / www.google.com / insights / search / 
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Figure 1: World-wide personality trends 




Figure 2: Personality trends in specific geos 

are high at different epochs. For USA, one such high occurs 
around Sep 15, when US Open is a dominating topic. For 
UK, we can see at high around Aug 15 (Football, Premiere 
League) and for Australia around Jul I (Jeff Goldblum 's 
demise) . For India, the relative strengths of world-wide and 
geographic influences are somewhat weaker. For China, the 
pattern looks different. The strengths of the various influ- 
ences stay relatively stable, and geographic influence is much 
stronger than the other 4 cases. 

Topic Character Trends: As a final example of the 
variety of analysis that D-MRelCRP can perform, we look 
at trends in topic characters. By aggregating Pk,u,f,t over 
all users and then using Bayes rule, we can find the pos- 
terior distribution Pf\k,t ° ver different influence factors for 
each topic k at epoch t. By plotting this over epochs, we 
can see how a topic changes its 'character', and moves from 
a 'geographic' topic to a 'world-wide' topic, for example. We 
illustrate this in Figure [3] (best viewed in color), using 3 top- 
ics. Japan Earthquake evolved from a geographic topic to a 
world-wide topic, Google Wave from a personal preference 
topic to a world-wide topic, and Tiger Woods from a per- 
sonal preference topic, to a geographic topic, and finally a 
world-wide topic. 

In summary, DMelCRP enables a wide variety of analysis 
of influences, leading to many interesting insights, beyond 
the capability of existing models. 

5.4 Other Experiments 

In our experiments, we have employed a Java-based multi- 
threaded framework over an 8-core, 32 GB RAM machine. 
We employed K — 7 child threads, read N = 35K posts in a 
mini-batch, and used t — 100 Gibbs iterations per batch. In 
Figure [4] we plot the time taken (in micro-sees) to process 
one post by the multi-threaded version and a sequential ver- 
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Figure 3: Character trends of 3 topics 



we found insightful patterns of influence on social network 
users, beyond the capability of existing models. 




Figure 4: Scalability of inference algorithms 



sion, after having processed N posts. This time increases as 
the number of living topics increases with N. The two plots 
clearly demonstrate that superior scalability of our multi- 
threaded inference algorithm. 

Finally, in Table[5] we analyze the contributions of the dif- 
ferent aspects to DMRelCRP's final performance. We can 
see that the model improves (both in terms of perplexity 
and prediction accuracy) through the addition of more rela- 
tionships and the interplay between them, which is the main 
strength of the model. 

6. CONCLUSIONS 

In this paper, we have made a first attempt at studying 
the important problem of analyzing user influences in gen- 
eration of social media data. We have proposed a new non- 
parametric model called Dynamic Multi-Relational CRP that 
incorporates the aggregated influence of multiple relation- 
ships into the data generation process as well as dynamic 
evolution of model parameters to capture the essence of so- 
cial network data. Our multi-threaded online inference al- 
gorithm allowed us to analyze a collection of 360 million 
tweets. Through extensive evaluations, we demonstrated 
that the topic trends discovered by our model are superior 
to those from state-of-the-art baselines. More importantly, 



Table 5: Importance of Model Factors. R u corre- 
sponds to user preferences, R w , Rn,R g to world-wide, 
friend-network and geographic factors, resp. 
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